{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# PCA as a Factor Model - Coding Exercises\n",
    "\n",
    "\n",
    "### Introduction\n",
    "\n",
    "As we learned in the previous lessons, we can use PCA to create a factor model of risk. Our risk factor model represents the return as:\n",
    "\n",
    "$$\n",
    "\\textbf{r} = \\textbf{B}\\textbf{f} + \\textbf{s}\n",
    "$$\n",
    "\n",
    "where $\\textbf{r}$ is a matrix containing the asset returns, $\\textbf{B}$ is a matrix representing the factor exposures, $\\textbf{f}$ is the matrix of factor returns, and $\\textbf{s}$ is the idiosyncratic risk (also known as the company specific risk).\n",
    "\n",
    "In this notebook, we will use real stock data to calculate:\n",
    "\n",
    "* The Factor Exposures (Factor Betas) $\\textbf{B}$\n",
    "* The Factor Returns $\\textbf{f}$\n",
    "* The Idiosyncratic Risk Matrix $\\textbf{S}$\n",
    "* The Factor Covariance Matrix $\\textbf{F}$\n",
    "\n",
    "We will then combine these quantities to create our Risk Model. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Install Packages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "!{sys.executable} -m pip install -r requirements.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Get Returns\n",
    "\n",
    "In this notebook, we will get the stock returns using Zipline and data from Quotemedia, just as we learned in previous lessons. The function `get_returns(start_date, end_date)` in the `utils` module, gets the data from the Quotemedia data bundle and produces the stock returns for the given `start_date` and `end_date`. You are welcome to take a look at the `utils` module to see how this is done.\n",
    "\n",
    "In the code below, we use `utils.get_returns` funtion to get the returns for stock data between `2011-01-05` and `2016-01-05`. You can change the start and end dates, but if you do, you have to make sure the dates are valid trading dates. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import utils\n",
    "\n",
    "# Get the returns for the fiven start and end date. Both dates must be valid trading dates\n",
    "returns = utils.get_returns(start_date='2011-01-05', end_date='2016-01-05')\n",
    "\n",
    "# Display the first rows of the returns\n",
    "returns.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# TODO: Factor Exposures\n",
    "\n",
    "In the code below, write a function, `factor_betas(pca, factor_beta_indices, factor_beta_columns)` that calculates the factor exposures from Scikit-Learn's `PCA()` class. Remember the matrix of factor exposures, $\\textbf{B}$, describes the coordintates of the Principal Components in the original basis. The `pca` parameter must be a Scikit-Learn's pca object, that has fit the model with the returns. In other words, you must first run `pca.fit(returns)` before passing this parameter into the function. Later in this notebook we will create a function, `fit_pca()`, that will fit the pca model and return the `pca`object. The `factor_beta_indices` parameter must be a 1 dimensional ndarray containg the column names of the `returns` dataframe. The `factor_beta_columns` parameter must be a 1 dimensional ndarray containing evenly spaced integers from 0 up to the number of principal components you used in your `pca` model minus one. For example, if you used 5 principal compoenents in your `pca` model, `pca = PCA(n_components = 5)`, then `factor_beta_columns = [0, 1, 2, 3, 4]`. This function has to return a Pandas dataframe with the factor exposures, where the `factor_beta_indices` correspond to the indices of the dataframe and the `factor_beta_columns` correspond to the column names of the dataframe. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def factor_betas(pca, factor_beta_indices, factor_beta_columns):\n",
    "\n",
    "    #Implement Function\n",
    "    \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# TODO: Factor Retuns\n",
    "\n",
    "In the code below, write a function, `factor_returns(pca, returns, factor_return_indices, factor_return_columns)` that calculates the factor returns from Scikit-Learn's `PCA()` class. Remember the matrix of factor returns, $\\textbf{f}$, represents the `returns` written in the **new** basis. The `pca` parameter must be a Scikit-Learn's pca object, that has fit the model with the returns. In other words, you must first run `pca.fit(returns)` before passing this parameter into the function. Later in this notebook we will create a function, `fit_pca()`, that will fit the pca model and return the `pca`object. The `returns` parameter is the pandas dataframe of returns given at the begining of the notebook. The `factor_return_indices` parameter must be a 1 dimensional ndarray containing the trading dates (Pandas `DatetimeIndex`) in the `returns` dataframe. The `factor_return_columns` parameter must be a 1 dimensional ndarray containing evenly spaced integers from 0 up to the number of principal components you used in your `pca` model minus one. For example, if you used 5 principal compoenents in your `pca` model, `pca = PCA(n_components = 5)`, then `factor_beta_columns = [0, 1, 2, 3, 4]`. This function has to return a Pandas dataframe with the factor returns, where the `factor_return_indices` correspond to the indices of the dataframe and the `factor_return_columns` correspond to the column names of the dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd \n",
    "\n",
    "def factor_returns(pca, returns, factor_return_indices, factor_return_columns):\n",
    "    \n",
    "    #Implement Function\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# TODO: Idiosyncratic Risk Matrix\n",
    "\n",
    "Let's review how we can calculate the Idiosyncratic Risk Matrix $\\textbf{S}$. We know that: \n",
    "\n",
    "$$\n",
    "\\textbf{s} = \\textbf{r} - \\textbf{B}\\textbf{f}\n",
    "$$\n",
    "\n",
    "We refer to $\\textbf{s}$ as the residuals. To calculate the idiosyncratic or specific risk matrix $\\textbf{S}$, we have to calculate the covariance matrix of the residuals, $\\textbf{s}$, and set the off-diagonal elements to zero. \n",
    "\n",
    "With this in mind, in the code below cerate a function, `idiosyncratic_var_matrix(returns, factor_returns, factor_betas, ann_factor)` that calclates the **annualized** Idiosyncratic Risk Matrix. The `returns` parameter is the pandas dataframe of returns given at the begining of the notebook. The `factor_returns` parameter is the output of the `factor_returns()` function created above. Similarly, the `factor_betas` parameter is the output of the `factor_betas()` function created above. The `ann_factor` parameter is an integer representing the annualization factor. \n",
    "\n",
    "Remember that if the `returns` time series are daily returns, then when we calculate the Idiosyncratic Risk Matrix we will get values on a daily basis. We can annualize these values simply by multiplying the whole Idiosyncratic Risk Matrix by an annualization factor of 252. Remember we don't need the square root of the factor because our numbers here are variances not standard deviations.\n",
    "\n",
    "The function must return a pandas dataframe with the annualized Idiosyncratic Risk Matrix containing the covariance of the residuals in its main diagonal and with all the off-diagonal elements set to zero. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def idiosyncratic_var_matrix(returns, factor_returns, factor_betas, ann_factor):\n",
    "    \n",
    "    #Implement Function\n",
    "    \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# TODO: Factor Covariance Matrix\n",
    "\n",
    "To calculate the annualized factor covariance matrix, $\\textbf{F}$, we use the following equation:\n",
    "\n",
    "$$\n",
    "\\textbf{F} = \\frac{1}{N -1}\\textbf{f}\\textbf{f}^T\n",
    "$$\n",
    "\n",
    "where, $N$ is the number of elements in $\\textbf{f}$. Recall that the factor covariance matrix, $\\textbf{F}$, is a diagonal matrix.\n",
    "\n",
    "With this in mind, create a function, `factor_cov_matrix(factor_returns, ann_factor)` that calculates the annualized factor covariance matrix from the factor returns $\\textbf{f}$. The `factor_returns` parameter is the output of the `factor_returns()` function created above and the `ann_factor` parameter is an integer representing the annualization factor. The function must return a diagonal numpy ndarray \n",
    "\n",
    "**HINT :** You can calculate the factor covariance matrix $\\textbf{F}$ very easily using Numpy's `.var` method. The $\\frac{1}{N -1}$ factor can be taken into account using the `ddof` keyword. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def factor_cov_matrix(factor_returns, ann_factor):\n",
    " \n",
    "    #Implement Function\n",
    "    \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# TODO: Perfom PCA\n",
    "\n",
    "In the code below, create a function, `fit_pca(returns, num_factor_exposures, svd_solver)` that uses Scikit-Learn's `PCA()` class to fit the `returns` dataframe with the given number of `num_factor_exposures` (Principal Components) and with the given `svd_solver`. The `returns` parameter is the pandas dataframe of returns given at the begining of the notebook. The `num_factor_exposures` parameter is an integer representing the number of Principal Components you want to use in your PCA algorithm. The `svd_solver` parameter is a string that determines the type of solver you want to use in your PCA algorithm. To see the type of solvers that you can use, see the [Scikit-Learn documentation](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html). The function must fit the `returns` and return the `pca` object. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.decomposition import PCA\n",
    "\n",
    "def fit_pca(returns, num_factor_exposures, svd_solver):\n",
    "\n",
    "    #TODO: Implement function\n",
    "    \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# TODO: Create The Risk Model\n",
    "\n",
    "\n",
    "In the code below, create a class:\n",
    "\n",
    "```python\n",
    "class RiskModel(object):\n",
    "    def __init__(self, returns, ann_factor, num_factor_exposures, pca):\n",
    "```\n",
    "\n",
    "where the `returns` parameter is the pandas dataframe of returns given at the begining of the notebook. The `ann_factor` parameter is an integer representing the annualization factor. The `num_factor_exposures` parameter is an integer representing the number of Principal Components you want to use in your PCA algorithm. The `pca` parameter is the output of the `fit_pca()` function created above. The class must contain all the fucntions created above. For example, to include the Factor covariance matrix we will use:\n",
    "\n",
    "```python\n",
    "self.factor_cov_matrix_ = factor_cov_matrix(self.factor_returns_, ann_factor)\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "class RiskModel(object):\n",
    "    \n",
    "    #Implement class\n",
    "  \n",
    "\n",
    "# Set the annualized factor\n",
    "ann_factor = \n",
    "\n",
    "# Set the number of factor exposures (principal components) for the PCA algorithm\n",
    "num_factor_exposures = \n",
    "\n",
    "# Set the svd solver for the PCA algorithm\n",
    "svd_solver = 'full'\n",
    "\n",
    "# Fit the PCA Model using the fit_pca() fucntion \n",
    "pca = \n",
    "\n",
    "# Create a RiskModel object\n",
    "rm = "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# TODO: Print The Factor Exposures"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Display the Factor Exposures\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# TODO: Print The Factor Returns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Display the Factor Returns\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# TODO: Print The Idiosyncratic Risk Matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Display the Idiosyncratic Risk Matrix\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# TODO: Print The Factor Covariance Matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Display the Factor Covariance Matrix\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# View The Percent of Variance Explained by Each Factor"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# Set the default figure size\n",
    "plt.rcParams['figure.figsize'] = [10.0, 6.0]\n",
    "\n",
    "# Make the bar plot\n",
    "plt.bar(np.arange(num_factor_exposures), pca.explained_variance_ratio_);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can see that the first factor dominates. The precise defintion of each factor in a latent model is unknown, however we can guess at the likely intepretation.\n",
    "\n",
    "# View The Factor Returns\n",
    "\n",
    "Remember that the factors returns don't necessarily have direct interpretations in the real world but you can thinik of them as returns time series for some kind of latent or unknown driver of return variance. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# Set the default figure size\n",
    "plt.rcParams['figure.figsize'] = [10.0, 6.0]\n",
    "\n",
    "rm.factor_returns_.loc[:,0:5].cumsum().plot();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Solution\n",
    "\n",
    "[Solution notebook](pca_factor_model_solution.ipynb)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
